This report explores a dataset containing attributes for 4898 instances of the Portuguese “Vinho Verde” white wine.
The attributes are the following:
The structure of the data:
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## [1] 4898 13
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Our dataset contains 4898 observations and 13 variables. The structure shows that all the variables are classified as numerical.
## [1] "3" "4" "5" "6" "7" "8" "9"
Here I converted the numerical variable quality to a factor variable.
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
The distribution of quality appears to be normal. The median and the mean are almost the same. There are more than 2000 wines with a 6 rating. Since the quality is rated between 0 (very bad) and 10 (excellent), this means that most of the wines are above average.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Count of wines by ratings. We can clearly see that there are more than 200 wines rated 6.
Here we have the distribution of the percent alcohol content of the wine. It appears to be slightly skewed with the alcohol peaking at around 9.5.
We have a normal distribution of sulphates, wine additives which can contribute to sulfur dioxide gas levels, which acts as antimicrobials and antioxidants.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic). Most of the wines are between 3 and 3.5 on the pH scale.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Density appears to be normally distributed across white wines.
Total sulfur dioxide represents the amount of free and bound forms of sulfur dioxide gas. In low concentrations, sulfur dioxide is mostly undetectable in wine, but at free concentrations over 50 grams/liter, sulfur dioxide becomes evident in the nose and taste of wine. Free sulfur dioxide prevents microbial growth and the oxidation of wine.
This histogram shows the amount of salt in white wines. The majority of wines have less than 0.1 gram/liter.
#5 number summary of residual sugar
summary(df$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
This plot shows the distribution of the amount of sugar remaining after fermentation stops. Wines with greater than 45 grams/liter are considered sweet. Most of our wines have less than 10 grams/liter.
The citric acid is responsible for the wines’ “freshness” and flavors. Most of our wines have citric acid less than 0.5 grams/liter.
The volatile acidity histogram shows the distribution of the amount of acetic acid in wines, which at too high levels can lead to an unpleasant, vinegar taste. The peak is at around 0.25.
The fixed acidity histogram shows the tartaric acid of wines.
There are 4898 white wine observations and 13 features: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality.
The main features in our data set are quality, alcohol %, pH, residual sugar, citric acid and volatile acidity. I suspect these and in combination with other variables determine the quality rating.
I believe also free sulfur dioxide and total sulfur dioxide in relationship with the rest of the variables could contribute to my analysis of quality.
There is a moderate positive correlation between alcohol and quality. We can see that as the alcohol increases the rating slightly increases as well.
There is no meaningful relationship between sulphates and quality. This wine additive has no impact on quality.
There is no meaningful relationship between pH and quality.
We have a small negative relationship between quality and density. As the density increases, the quality decreases.
From the plot above we see that there is a small negative relationship between total sulfur dioxide and quality. This means that as the total sulfur dioxide increases the quality decreases.
There is no meaningful correlation between free sulfur dioxide and quality.
There is a small negative correlation between chlorides and quality. If the amount of salt increases the quality decreases.
There is no meaningful relationship between residual sugar and quality.
There is no meaningful relationship between citric acid and quality.
Small negative correlation between volatile acidity and quality.
There is no clear correlation between fixed acidity and quality.
We observe a strong negative correlation between density and alcohol. As the percent of alcohol increases the density decreases.
We have a moderate negative relationship between alcohol and total sulfur dioxide.
The plot above shows a moderate negative correlation between fixed acidity and pH.
This is a strong positive correlation between density and residual sugar.
This is a moderate positive correlation between density and total sulfur dioxide.
We observe a moderate positive correlation between total sulfur dioxide and free sulfur dioxide.
Unexpectedly, the main features of interest that I listed above have no meaningful relationship with quality. The ones that have a correlation though, are alcohol which is moderate positive and volatile acidity which is small negative.
I have found that there is a small negative correlation between density and quality and between total sulfur dioxide and quality. The amount of salt also has a small impact in quality (chlorides).
I also found that there are some relationships between the features of wines. For example, I observed a strong negative correlation between density and alcohol. As the percent of alcohol increases the density decreases. There is also a moderate negative relationship between alcohol and total sulfur dioxide. There is a strong positive correlation between density and residual sugar.
The strongest correlations I found are between other features. Strong positive correlation between residual sugar and density, as the amount of sugar increases the density increases. Another strong relationship was observed between density and alcohol.As the percent of alcohol increases the density decreases.
In this plot we can see that the average alcohol percent is higher for the wines with higher quality rating.
Here we have a histogram of alcohol wrapped by quality. There is a normal distribution in alcohol for the quality above 5. We can see here that wines rated above average peak around 11% alcohol.
In this plot we observe a strong negative relationship between alcohol and density, especially for the wines that are rated above 5 in quality.
We observe here a moderate negative relationship between alcohol and total sulfur dioxide wrapped by quality, especially for the wines that are rated above 5.
I omited here the 1% data in residual sugar. There is a strong relationship between density and residual sugar for wines rated 5 and above 5. As the amount of sugar increases the density increases.
This plot shows pH for each quality in relationship with fixed acidity.
I observed that wines which have a higher percent of alcohol are higher rated. Also I observed that wines which are higher rated, thus an increased percent of alcohol have a smaller density.
Higher rated wines have a lower total sulfur dioxide, which means that in low concentrations sulfur dioxide is mostly undetectable.
Higher rated wines have a lower amount of sugar than the other rated categories.
## $title
## [1] "Alcohol Distribution by Quality"
##
## attr(,"class")
## [1] "labels"
Here we have the distribution of the percent alcohol content of the wine. It appears to be slightly skewed with the alcohol peaking at around 9.5. However, if we look at the higher rated wines, we see that the peak is around 11. So, wines rated above average have a higher percent of alcohol.
There is a small negative relationship between quality and amount of salt in wine. This means that higher rated wines have less salt than the lower rated wines. We can clearly see that as the mean decreases in salt the quality increases.
In this plot we see that most the wines rated above average are less sweeter than the rest of the wines.
I found that the wines that have a higher rating in quality have a higher percent of alcohol, are less sweeter and have less salt. I also found that there are also other features that have a meaningful impact in quality, like density total amount of sulfur dioxide.
I was expected that the main features like pH, residual sugar, acid citric and volatile acidity would have a direct impact over the quality of the wine, but it seems they have in combination with other features.
The challenges I enocountered were the fact the variables were not clearly explained as they represent chemical properties. We need to make sure we understand them so that we are able to present them in a concise and clear manner, so that the audience would know what we are talking about.
On the other hand, I found a tidy data set which was easy to work with as I didn’t have to struggle with cleaning the data.
Overall I am content about the data set I have analyzed and of the insights I found. I didn’t expect to find features that I wouldn’t think about to contribute to the wines quality ratings.
For the future I would perform further analyses outside EDA in order to confirm the findings or find some insights that perhaps are not obvious at this stage.